Skip to content

chore(depot): fault injection tests#4868

Draft
NathanFlurry wants to merge 27 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests
Draft

chore(depot): fault injection tests#4868
NathanFlurry wants to merge 27 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 2, 2026

🚅 Deployed to the rivet-pr-4868 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web May 2, 2026 at 1:10 pm
frontend-cloud 😴 Sleeping (View Logs) Web May 2, 2026 at 12:03 pm
kitchen-sink ❌ Build Failed (View Logs) Web May 2, 2026 at 9:22 am
ladle ✅ Success (View Logs) Web May 2, 2026 at 3:44 am
mcp-hub ✅ Success (View Logs) Web May 2, 2026 at 3:42 am
frontend-inspector ❌ Build Failed (View Logs) Web May 2, 2026 at 3:42 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 2, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Code Review: chore(depot): fault injection tests

This PR adds a comprehensive fault injection framework for the depot SQLite storage layer, including the controller API, hook sites across commit/read/compaction/reclaim paths, a compaction test driver, and both simple CI tests and a chaos harness. The overall architecture is well-considered. Here are the findings organized by severity.

Issues

1. wait_reached / pause_until_released has a TOCTOU race on the reached flag.

In controller.rs, wait_reached checks the atomic before arming the notification. If the paused task sets reached and calls notify_waiters() between the load and the notified().await arm, the notification is lost and wait_reached hangs indefinitely. CLAUDE.md states that waiters must arm the notification before re-checking the counter. Fix:

pub async fn wait_reached(&self) {
    loop {
        let notified = self.state.reached_notify.notified();
        if self.state.reached.load(Ordering::SeqCst) {
            return;
        }
        notified.await;
    }
}

The same pattern applies to pause_until_released.

2. DepotCompactionTestDriver::wait_until is a polling loop with a sleep — CLAUDE.md violation.

test_driver.rs uses loop { check(); sleep(25ms) }. CLAUDE.md calls out the loop { check; sleep } pattern as polling and says to use event-driven coordination instead. This is test-only code behind test-faults, which limits the impact, but it should be addressed by exposing a watch::channel from the Gasoline debug API or subscribing to workflow completion events rather than polling history rows.

3. Guard types are not marked #[must_use].

WorkflowFaultControllerGuard, WorkflowColdTierGuard, and ColdObjectDeleteGraceGuard perform cleanup via Drop. If a caller writes register_workflow_fault_controller(...) without binding to a variable, the guard is immediately dropped and the registration is silently undone. DepotFaultRuleBuilder is correctly marked #[must_use]; the guard types should be too.

4. BeforeColdDelete and AfterColdDelete fault points are reused between delete_retired_cold_objects and delete_orphan_cold_objects.

A single rule targeting those points will fire on either path, making test assertions ambiguous when both activities run in the same compaction cycle. Consider adding an OrphanDelete sub-point or an activity-label context dimension so tests can target one path exclusively.

5. ShardCacheFillFaultPoint variants are defined but have no hook sites wired in production code.

The BeforeGetObject, AfterGetObject, BeforeShardWrite, AfterShardWrite, and Skipped variants are exhaustively matched so it compiles, but these variants can never fire. Any test rule targeting them will silently fail assert_expected_fired(). Either wire the hooks in conveyer/read/cache_fill.rs or document them as Phase 5 stubs with a TODO comment.

Code Quality Observations

6. Variable shadowing for output in companion.rs is subtle.

The #[cfg(feature = "test-faults")] block rebinds output by shadowing the pre-existing binding. This compiles correctly because the shadow takes the same type, but an explicit reassignment or an if let Err(...) branch would be clearer.

7. Three Mutex<Vec<...>> process-global statics do linear scans on every access.

WORKFLOW_FAULT_CONTROLLERS, COLD_OBJECT_DELETE_GRACE_OVERRIDES, WORKFLOW_TEST_COLD_TIERS all do linear scans. For test usage with small Vecs this is fine, but scc::HashMap would be the CLAUDE.md-preferred pattern if these statics grow or tests become parallel.

8. Repeated Ok(Some(_)) | Ok(None) => Ok(None) match arms across every fault-output helper.

All *_fault_output functions share the same structure: fire the hook, return None on both Ok cases, convert Err to a failed output. This is a good candidate for a single generic helper to reduce the approximately 15 identical match arms across the diff.

9. delete_cold_fault_output carries deleted_object_keys: Vec<String> by value.

The caller must .clone() keys before calling the fault helper even though keys are needed after the call. The signature could instead borrow the slice and only clone inside the error branch.

Minor / Nit

10. check-production-fault-leaks.sh uses python3 without checking it is on PATH.

Add a guard near the top of the metadata section to produce a clear error in restricted CI environments instead of a confusing python3: command not found.

11. delta_chunks.sort_by_key(|(key, _)| key.clone()) clones each key for comparison.

Use sort_by(|(a, _), (b, _)| a.cmp(b)) to avoid allocations.

12. u32::try_from(expected_idx).unwrap_or(u32::MAX) produces a misleading error on overflow.

An explicit ensure! before the loop with a clear message would be better, though this is practically impossible in normal usage.

Test Coverage Assessment

Coverage is strong for implemented fault points: pre-durable commit faults, ambiguous post-durable commit faults, hot install after shard publish, cold upload-succeeds-publish-fails, and forced reclaim with short grace. Chaos tests are appropriately #[ignore]d with seed-based replay.

Gaps to note for follow-up:

  • ShardCacheFillFaultPoint variants other than BeforeEnqueue are not covered because the hooks are not wired yet (issue 5 above).
  • The anti-mirror smoke test in the VFS harness is worth making CI-visible as a non-ignored test.
  • assert_expected_fired() is called consistently in fault tests — good.

Summary

The most important issue to address before merge is the wait_reached / pause_until_released notification race (1), which can cause hangs in any pause-based test. Adding #[must_use] to the guard types (3) prevents a class of silently-misconfigured tests. The polling wait_until (2) is a CLAUDE.md violation but acceptable if a comment acknowledges it as a known limitation. All other findings are improvements rather than blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant